A comparative study of social network models: network evolution models and 

nodal attribute models 



Riitta Toivonen*'^, Lauri Kovanen^, Mikko Kivela'^, Jukka-Pekka Onnela''''^'^, Jari Saramaki^, Kimmo 

KaskP 

^Department of Biomedical Engineering and Computational Science (BECS), Helsinki University of Technology, P.O. Box 

9203, FIN-02015 HUT, Finland 
^Physics Department, Clarendon Laboratory, Oxford University, Oxford OXl 3PU, United Kingdom 
'^Sa%d Business School, Oxford University, Oxford 0X1 IHP, United Kingdom 



Abstract 

This paper reviews, classifies and compares recent models for social networks that have mainly been published 
within the physics-oriented complex networks literature. The models fall into two categories: those in 
which the addition of new links is dependent on the (typically local) network structure {network evolution 
models, NEMs), and those in which links are generated based only on nodal attributes {nodal attribute 
models, NAMs). An exponential random graph model (ERGM) with structural dependencies is included for 
comparison. We fit models from each of these categories to two empirical acquaintance networks with respect 
to basic network properties. We compare higher order structures in the resulting networks with those in the 
data, with the aim of determining which models produce the most realistic network structure with respect to 
degree distributions, assortativity, clustering spectra, geodesic path distributions, and community structure 
(subgroups with dense internal connections). We find that the nodal attribute models successfully produce 
assortative networks and very clear community structure. However, they generate unrealistic clustering 
spectra and peaked degree distributions that do not match empirical data on large social networks. On the 
other hand, many of the network evolution models produce degree distributions and clustering spectra that 
agree more closely with data. They also generate assortative networks and community structure, although 
often not to the same extent as in the data. The ERGM model turns out to produce the weakest community 
structure. 
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1. Introduction 

Modeling social networks serves at least two pur- 
poses. Firstly, it helps us understand how social 
networks form and evolve. Secondly, in studying 
network-dependent social processes by simulation, 
such as diffusion or retrieval of information, suc- 
cessful network models can be used to specify the 
structure of interaction. A large variety of models 
have been presented in the physics-oriented com- 
plex networks literature in recent years, to explore 
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how local mechanisms of network formation pro- 
duce global network structure. In this paper we 
review, classify and compare such models. 

The models are classified into two main cate- 
gories: those in which the addition of new links is 
dependent on the local network structure {network 
evolution models, NEMs), and those in which the 
probability of each link existing depends only on 
nodal attributes {nodal attribute models, NAMs). 
NEMs can be further subdivided into growing mod- 
els, in which nodes and links are added until the 
network contains the desired number N of nodes, 
and dynamical models, in which the steps for adding 
and removing ties on a fixed set of nodes are re- 
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peated until the structure of the network no longer 
statistically changes. For completeness, we in- 
clude in our comparative study two models from 
the tradition of exponential random graph models 
(ERGMs). One of them is based solely on nodal 
attributes, and the other incorporates structural 
dependencies. All of these models produce undi- 
rected networks without multiple links or self- links, 
and all networks are treated as unweighted, i.e. tie 
strengths are not taken into account. We note that 
some of the models were designed with a particu- 
lar property in mind, such as a high average clus- 
tering coefficient, but we will assess their ability 
to reproduce several of the typical features of so- 
cial networks. In addition to comparing the distri- 
butions of degree and geodesic path lengths and 
clustering spectra, we assess the presence or ab- 
sence of communities, which in the complex net- 
works literature are typically defined as groups of 
nodes that are more densely connected to nodes in 
the sam e community than to node s in ot her com- 
munities Fortunato and Castellanj 1 20081 ). 

This paper is structured as follows. In Sections 
ll.ll to fOl we define the categories of network evolu- 
tion models and nodal attribute models, and briefly 
review exponential random graph models. Section 
11.41 discusses differences between the philosophies 
behind NEMs and ERGMs. We fit models from 
each of these categories to two empirical acquain- 
tance networks with respect to basic network statis- 
tics. The fitting procedure is discussed in Sec- 
tion [3] and Appendix lA.2l In Section IH we compare 
higher order structures in the resulting networks 
with those in the data. Section \E\ summarizes our 
results. 

1.1. Network evolution models (NEMs) 

Let us first present a class of network models that 
focuses on network evolution mechanisms. These 
models test hypotheses that specific network evolu- 
tion mechanisms lead to specific network structure. 
We call these network evolution models (NEMs), 
and define them via three properties as follows: 

1) A single network realization G is produced by an 
iterative process that always starts from an ini- 
tial network configuration G{to) specified in the 
NEM. Dynamical models often begin with an 
empty network, and growing models start with 
a small seed network^. 



2) The specifications of the NEM include an ex- 
plicitly defined set of stochastic rules by which 
the network structure evolves in time. These 
rules concern selecting a subset of nodes and 
links at each time step, and adding and delet- 
ing nodes and links within this subset. The 
rules typically correspond to abstracted mech- 
anisms of social tie forma tion such as triadic 
I973I ). i.e. tie formation 



closure ( Granovetter 



based on the tendency of two friends of an indi- 
vidual to become acquainted. The rules always 
depend on network structure and they can some- 
times also incorporate nodal attributes. The 
rules determine the possible transitions from one 
network G{tk-i) to the next G{tk) during the 
iterative process that will produce one network 
realization G = G{tend)- 

3) The NEM includes a stopping criterion: 

a) For a growing NEM, the algorithm finishes 
when the network has reached a predeter- 
mined size. The typical assumption is that 
relevant statistical properties of the network 
remain invariant once the network is large 
enough. 

b) For a dynamical NEM, the algorithm finishes 
when selected network statistics no longer 
vary^ . 



A growing model can be motivated as a model 
for social networks in several contexts. For exam- 
ple, on social networking sites people rarely remove 
links, and new users keep joining t he network. Sim - 
ilarly, in a co-authorship network NewmanI ( 2001 ') 
derived from pubHcation records, existing links re- 
main while new links form. We point out that the 
growing models do not intend to simulate the evo- 
lution of a social network ab initio. However, the 
mechanisms are selected to imitate the way people 
might join an already established social network. 



^The seed network does not always need to be exactly 



specified, as long as it meets the given general criterion (such 
as being small compared to the network that will be gener- 
ated), as it typically has a negligible efi'ect on the resulting 
network. 

^While the stopping criterion for a growing NEM is ex- 
act, requirement 3b) is a heuristic criterion that assumes that 
the algorithm will reach a stage at which the selected sta- 
tistical properties of the networks G'(t) stabilize. Although 
we cannot know with absolute certainty whether stationary 
distributions have been reached, we can be relatively confi- 
dent of it if monitored properties remain constant and their 
distributions appear stable for a large number of time steps. 



The NEMs in our comparative study include only 
network structure based evolution rules (that may 
depend on topology and tie strengths), although 
nodal attribute based rules are also possible. Mod- 
els in which link generation is based solely on (fixed) 
nodal attributes belong to the category of nodal at- 
tribute models (NAMs) discussed below. 

1.2. Nodal attribute models (NAMs) 

We adopt the term nodal attribute models 
(NAMs) for network models in which the probabil- 
ity of edge Cij between nodes i and j being present 
is explicitly stated as a function of the attributes of 
the nodes i and j only, and the evolutionary aspect 
is absent. NA Ms are often based on t he concept 



of homophily (jMcPherson et al.l . l200lf ). the ten 



dency for like to interact with like, which is known 
to structure network ties of various types, includ- 
ing friendship, work, marriage, information trans- 
fer, and other forms of relationship. Such models 
hav e also been de s cribed by the tern i spati al mod- 
els l|Boguna et all . l2004l : IWong et all l2006h . refer- 
ring to that the fact that the attributes of each node 
determine its 'location' in a social or geographical 
space. 

1.3. Exponential random graph models (ERGMs) 
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CERGMsl llFrank and Straussl. [iQSgI : iFrankl. Il991 
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Robins et al 



Robins et al- 
2007bll . 



Wasserman and Pattisonl. 
20Q7al : ISnijders et all 12005 
also called p* models, are used to test to what 
extent nodal attributes (exogenous factors) and 
local structural dependencies (endogenous factors) 
explain th e obser ved global structure. For example, 
Goodreaul l|2007l ) used ERGMs to infer that much 
of the global structure (measured in terms of the 
distributions of degree, edgewise shared partners 
and geodesic paths) observed in a friendship 
network could be captured by nodal attributes and 
patterns of shared partners and fc-triangles, which 
are relatively local structures. 

Consider a random graph X consisting of N 
nodes, in which a possible tie between two nodes 
i and j is represented by a random variable Xij, 
and denote the set of all such graphs by X. Using 
this notation, ERGMs are defined by the probabil- 
ity distribution of such graphs X 



cxp {9* u(a::)} 



where is the vector of model parameters, u(a;) 
is a vector of network statistics based on the net- 
work reaHzation x, and the denominator c{9, X) is 
a normalization function that ensures that the dis- 
tribution sums up to one. The selected statistics 
u(a;) specify a particular ERGM model. Typically, 
the parameters 9 of an ERGM model are deter- 
mined using a maximum likelihood (ML) estimate, 
obtained b y Markov Chain Mont e Carlo (MCMC) 
samp ling I Geyer and Thompsonl . Il992l : iSnijdersl . 
2002) • MCMC sampling heuristics are also used 
to draw network realizations from the distribu- 
tion Pg^x- Several software packages are designed 
for fitting and simulating ERGMs (in cluding pnet, 
SIENA , and statnet, discussed by iRobins et al.l 

(HqqiE)). 



(1) 



1.4. Differences between NEMs and ERGMs 

An important difference between network evolu- 
tion models and exponential random graph models 
is that a NEM is determined by the rules of net- 
work evolution, whereas ERGMs do not explicitly 
address network evolution processes. The particu- 
lar update steps employed in the iterative MCMC 
procedure for drawing samples are not explicitly 
specified in ERGMs, which are defined by the prob- 
ability distribution Pe^x-, although MCMC methods 
can also be used to model the evolution of social 
networks ( Snijderj . 19961 . 2001). A class of proba- 
bility models that includes network evolution is the 
stochastic a ctor-orien t ed mo dels for network change 
proposed by Sniiders 1 1996[ ). which are continuous- 
time Markov chain models that are implemented 
as simulation models. Another difference is that 
unlike ERGMs, NEMs explicitly specify an initial 
configuration from which the iteration is started, 
as well as a stopping criterion. However, NEMs are 
typically not sensitive to the initial configuration. 

One of the known problems with ERGMs is 
that the distributions of their suffici ent statistics 
may be multimodal ( Snijderd . 2002[ ). This has 
been of particular concern with respect to ERGMs 
that include statistics related to transitivity, which 
is a highly relevant feature in modeling social 
networks. The first stochasti c model to express 
trans itivity, the Markov graph I Frank and Strau"si . 
1986h . employed a simple triangle count term that 
is k nown to cause problems of model degener- 
acy ( JonassonL Il999f ). and to lead to instability in 
simulation of large networks with Markov Chain 
Monte Carlo (MCMC) me t hods ()Snijdersl . I2OO2I : 
Handcockl . l2003l : iGoodreaul . l2Q07t ). This problem 



seems to have been largely overcome with a re- 
cently proposed term related to triangles, the geo- 
metrically weighted edgewise shared partne rs statis- 



tic f GWESP) (ISniiders et al.. . 2006 ; Hunter et al 
20081 : iRobins et al.l . l2007bf l. We include in our com- 



parison an ERGM that includes the GWESP term. 
It turns out that we encounter instability even with 
this model. In fitting this model to our data, in the 
optimal parameter region a very small modification 
of the model parameters produces a large difference 
in the resulting network structure. This is discussed 
in Section [3] and Appendix lA. 21 

In contrast, transitivity is easy to incorporate in 
NEMs. Problems of multimodality have not been 
observed with NEMs. Although we do not always 
have theoretical certainty that the network evolu- 
tion rules could not lead to multimodal distribu- 
tions of network statistics, in practice the models 
with given parameters seem to consistently produce 
network realizations with similar statistics. 

The NEMs and ERGMs lend themselves to test- 
ing different kinds of hypotheses about networks. 
ERGMs can be employed to test to what ex- 
tent nodal attributes and local structural correla- 
tions explain the global structure. Although both 
NEMs and ERGMs can easily incorporate nodal at- 
tributes, they have rarely been included in NEMs. 
The NEMs proposed so far have been of a fairly 
generic nature, whereas the ERGM approach often 
aims to make inferences based on specific empiri- 
cal data, often including nodal attributes. On the 
other hand, NEMs can be employed for testing hy- 
potheses about network evolution, which ERGMs 
do not explicitly address. For example, a NEM 
can be used to test whether a combination of 
tie-strength-dependent triadic closure and global 
conn ections can produce a clearly clustered struc- 
ture (|Kumpula et al.l . l2007l) . Although ERGMs can 
also be interpreted as addressing endogenous (net- 
work structure based) selection processes via struc- 
tural dependencies, the mechanisms by which new 
ties are created based on the existing network struc- 
ture are made explicit only in NEMs. 

For the dynamical NEMs treated in this pa- 
per, it is easy to generate (and estimate parame- 
ters for) networks of 10 000 nodes or more. The 
growing models can easily produce networks with 
millions of nodes. Based on our hands-on experi- 
ence u^m£_state;of4he2art ERGM software (stat- 



net, 



Handcock et al.1 ()2003l . l2007t )). it seems that 



drawing a sample from an ERGM with structural 
dependencies. In generating network realizations 
from an ERGM, we used as a guideline that the 
number of MCMC steps, corresponding to the num- 
ber of proposals for changes in the link configura- 
tion, should be large enough such that the presence 
or absence of a link between each dyad is likely to 
be changed several times. With this approach, the 
number of MCMC steps should be proportional to 
the number of dyads, implying that the complexity 
is at least on the order of 0{N'^). This is already 
a much larger burden than the 0{N) complexity of 
NEMs based on local operations in the neighbor- 
hood of a selected node. Our assumption of the 
computational demands of ERGMs is supported by 
the fact that networks that have thus far been stud- 
ied with ERGMs have consisted typically of at mo st 
a couple of thousands of nodes ijGoodread . l2007r ) . 
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Figure 1: Categories of social network models. Within the 
category of NEMs, we focus on models based on triadic clo- 
sure and global connections ( TCG). Model labels correspond 
to models discussed in Section [J] 



2. Description of the models 

Many complex networks models study the ques- 
tion of whether structures observed in social net- 
works could be explained by the network-dependent 
interactions of nodes, without reference to intrin- 
sic properties of nodes. Such models are based on 
assumptions about the local mechanisms of tie for- 
mation, such as people meeting friends of friends, 
and thus forming connec tions with their ne twork 
neighbors (triadic closure I Granovetter . 1973h ). An 
additional mechanism to produce 'global' connec- 
tions beyond the local neighborhood is typically in- 
cluded t o account for sh ort average geodesic path 



generating a realization from a NEM might typ- 
ically have much lower computational cost than 



lengths (|Milgraml . 119671 ). Such connections may 
arise from encounters at common hobbies, places 
of work, etc. In models that do not consider 
nodal attributes, contacts between any dyads in the 
network are considered equally likely. These two 
mechanisms, triadic closure and global connections 



(TCG), form the basis of all the NEMs we study in 
this work. 

Tables [H [2] and [3] contain more detailed descrip- 
tions of the models and their parameters, with fixed 
parameters given in parentheses. Values of the fixed 
parameters were selected according to the original 
authors' choices wherever possible. We label the 
models using author initials. 



Dynamical network evolution models. We will first 
look at three dynamical models that combine tri- 
adic closure and global connections {TCG) for 
creating new l inks. These were proposed by 
iDavidsen etldl (l2002h rOEB). iMarsiU et all ^M) 
(MVS). and lKumnnla et aP (KOSKK). The 

different ways of implementing triadic closure and 
deletion of links in each of these models are high- 
lighted in Fig. m In triadic closure mechanism Tl, 
a node is introduced to another node by their com- 
mon neighbor. In mechanism T2, new contacts are 
made through search via friends: A node links to a 
neighbor of one of its neighbors. Dynamical models 
in which new links are continuously added must also 
include a mechanism for removing links, to avoid 
ending up with a fully connected network. In node 
deletion (ND), all links of a node are deleted. This 
emulates a node 'leaving' and a newcomer joining 
the network. In link deletion (LD), each link has a 
given probability of being deleted at each time step. 

The DEB model is the simplest of the three, 
with only two parameters, network size N and the 
probability p of deleting a node. The MVS and 
KOSKK models both use triadic closure mecha- 
nism T2, a two-step search in the neighborhood of 
a node, but the KOSKK model takes interaction 
strength into account. In KOSKK, new links are 
created preferably through strong ties, and every 
interaction further strengthens them. This mech- 
anisr a is able to produce clear community struc- 



ture (jKumpula et al.l . 120071 ) , confirmed by our anal- 



ysis in Section [H The three models also differ in 
whether a new node can remain isolated for several 
time steps (as in the MVS model) or will immedi- 
ately link to another node (as in KOSKK), and in 
whether there is a limit on the number of random 
connections each node can make (as in DEB). Be- 
cause of such differences, it is difficult to isolate the 
effects of the choices of Tl versus T2 and ND versus 
LD. Therefore, in Section l4?5l we will combine the 
fo ur mechanisms usin g the DEB model as a basis. 

iMarsili et al. ( 2004 ) did not mention which value 
they used in the MVS model for the probability 



A of deleting a link at each time step. We fixed 
A = 0.001 in our simulations, giving each tie an av- 
erage 'lifetime' of 1000 time steps. When generating 
network realizations, the dynamical models MVS, 
DEB, and KOSKK are iterated until monitored dis- 
tributions appear to become stationary. Sometimes 
the authors do not state which particular criterion 
they used. For the MVS and DEB models, we de- 
termined how many iterations (the steps described 
in Table [U it takes until average degree stabilizes 
and its distribution appears stationary. When gen- 
erating networks, we used a number of iterations 
above this limit. For the KOSKK model, we used a 
number of iterations determined by the authors to 
be sufficient for the distributions of degree and sev- 
eral other network properties to appear stationary 
(2.5 X lO'' X N, where N is network size, resulting in 
2 X 10® and 2.8 x 10^ for fitting to our data sets of 
sizes 8003 and 1133 presented in Section [3?2l 
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Figure 2: The dynamical network evolution models DEB, 
KOSKK, and MVS, classified according to the mechanisms 
for triadic closure and link deletion employed in them. 



Growing network evolution models. We inclu de two 
gro wing models, proposed bv lVazquezI (|2003h (Vaz) 
and lToivonen et al.1 liooi ) (TOSHK). They are de- 
scribed in detail in TablejH These are to our knowl- 
edge the only growing models specifically proposed 
for social acquaintance networks. The motivation 
behind the Vaz model is to produce a high level of 
clustering and a power law degree distribution. The 
TOSHK model also aims at a broad degree distribu- 
tion and a high clustering coefficient, but also sets 
out to reproduce other features observed in social 
networks, such as community structure. 

In TOSHK, each new node links to one or more 
'initial contacts', which in turn introduce the new- 
comer to some of their neighbors. In Vaz, a new- 
comer node first links to a random node i, creating 
potential edges (Vazquez's term) between itself and 
the neighbors of i. These ties may be realized later, 
generating triangles in the network. In both mod- 
els, triangles are only generated between the new- 



Table 1: Category: Dynamical network evolution models (dynamical NEMs). Three models based on triadic closure and global 
connections. 



Parameters Mechanisms. Number of nodes A'^ fixed; repeat steps for I) adding ties and II) deleting ties 

until stationary distributions are reached 

DEB fPavidsen et al. . 20021 

I) Select a node i randomly, and 

2 free a) if i has fewer than two ties, introduce it to a random node 

N p b) otherwise pick two neighbors of i and introduce them if they are not already acquainted. 

II) Select a random node and with prob. p remove all of its ties. 

MVS fMarsili et al. . 20041 

I) Select a node i randomly, and 

3 free a) connect i to another random node with probability rj. 

jy ^ ^ b) select a /riend's /riend of i (by uniformly random search) with probability 5 and introduce 

^^_Q ggj^-j i to it if not already acquainted. 

II) Select a random tie and delete it with probability A. 

KOSKK fKumpula et al. . 20071 

I) Select a node i randomly, and 

3 free a) select a friend's friend k (by weighted search) and introduce it to i with prob. pa (with 

-'V, PA, Pr initial tie strength mq) if not already acquainted. Increase tie strengths by 5 along the search 

{wo = 1, path, as well as on the link Uk if it was already present. 

p^ = 0.001, b) additionally, with prob. pr (or with prob. 1 if i has no connections), connect i to a random 

S = 0.5) node j (with tie strength wq). 

II) Select a random node and with prob. pd remove all of its ties. 

Nodes represent individuals and links represent ties between them. Parameters whose values were fixed according 
to the original authors' choices are shown in parentheses. 



comer and the neighbors of its initial contact, and 
further processes of introduction are ignored. As 
with all the models, we keep to the authors' choices 
presented in the original paper. Accordingly, in the 
TOSHK model, we allow a newcomer to link to at 
most two initial contacts (see Table [l]) , and pick 
the number of secondary contacts from the uniform 
distribution J7[0, fc], although this clearly Hmits the 
adaptability of the model. 

Nodal attribute models. We study two nodal at- 
tribute models that differ in the dependence 
of link probability on distance and in the em- 
plo yed distance measur e. These mode ls, proposed 
bv iBogufia et all l|2004l l (BPDA) and IWong et al ' 



iaOOfil) (WPR), are described in Table H The au- 



thors mention that a social space of any dimension 
could be used, but study the cases of ID and 2D, 
respectively. We keep to their choices. 

ERGM with structural dependencies. As our data 
does not contain nodal attributes, we can only in- 
clude structural terms in the exponential random 
graph model labeled ERGMl (Table S]). The term 



edge count is an obvious choice to include, in order 
to match average degree. We must also include a 
term related to triads, considering the prevalence 
of transitivity social networks. We employ the geo- 
metrically weighted edgewise shared partner statis- 



tic (GWESP), pr oposed by IS niiders elap (|2006l l 



and formulated by iHunter etal. (2008i) as 

n-2 

v{x;T)^e^J2^1-(l-e^y}EP,{x), (2) 



where the edgewise shared partners statistic EPi{x) 
indicates the number of unordered pairs {j, k} such 
that Xju = 1 and ?' and k have exactly i com- 
mon neighbors l|Huntei] . [2007l ). The simple trian- 
gle count term employed in Markov random graphs 
is known to cause problems of multimodality, and 
we are not aware of other triangle-related terms 
that would have been employed in ERGMs. Be- 
cause we would also like to match the degree dis- 
tribution to the data, we incl ude the geometricall 
weighted degree term (GWD) (jSnijders et al.l . l20Q 



Table 2: Category: Growing network evolution models (growing NEMs). Two models based on triadic closure and global 
connections. 



Parameters Mechanism. Repeat steps for I) adding nodes and ties II) adding ties only until network 

contains A'' nodes. 

TQSHK fToivonen et al. . 20061 

g j,^^^ I) Add a new node i to the network, connecting it to one random initial contact with proba- 

^ ^ bility p, or two with probability 1 — p- 

( ■' ^ Vfi rll ^'^'^ each random initial contact j, draw a number msec of secondary connections from the 

distribution U\Q,k] and connect i to msec neighbors of j if available. 

Vaz fVazauez. 20031 

I) With probability 1 — it, add a new node to the network, connecting it to a random node 
i. Potential edges are created between the newcomer n and the neighbors j of i (a potential 
edge means that n and j have a common neighbor, i, but no direct link between them). 
' ^ II) With probability m, convert one of such potential edges generated on any previous time 

step to an edge. Potential edges generated by converting an edge are ignored. 



Table 3: Category: Nodal attribute models (NAMs). 

Parameters Mechanism 
BPDA fBoguna et al. . 20041 

Distribute A'' nodes with uniform probability in a (1-dimensional) social space (a segment 

3 free of length hmax)- Link nodes with prob. p = 1/ (1 + (d/b)"), where d is their distance in the 
N, a, b social space, {hmax can be absorbed within b). If treated many-dimensionally, similarity 

along one of the social dimensions is sufficient for the nodes to be seen as similar. 

WPR fWong et al. . 20061 

Distribute A'' nodes according to a homogeneous Poisson point process in a (2-dimensional) 

4 free social space of unit area. Create a link between each node pair separated by distance d with 
A'', H, p, pb probability p + pb if d < H, and with probability p — pa if d > H (where PA{p,Pb, H) is such 

that the total fraction p of all possible links is generated). 



Hunter et ai1 . l2008h ^ 

n-2 

^z(x;T)=e-5]{l-(l-eT}A(x), (3) 



1=1 



where Di indicates the number of nodes with 
de gree i. We fix the parameter r = 0.25 as 
in IjGoodreaul . 120071 ) . We generate network reaUza- 



tions using the statnet software ijHandcock et al.l . 
20071). MCMC iterations are started from an 



Erdos-Renyi (BernoulU) network with average de- 
gree matching the target. We draw 5 reaUzations 
from each MCMC chain at intervals of 10'', using 
a burn-in of 5 x 10'' time steps. Model parameters 
are optimized consistently for all models with the 
procedure described in Section[3]and Appendix lA.2l 



' ^Goodreaul ll2007l ) observed that the model 
edges+covariates+GWESP explains much of the observed 
data (an adolescent friendship network with 1681 actors) 
and that no improvement is achieved by including the terms 
geometrically weighted degree (GWD) or geometrically 
weighted dyadwise shared partners statistic (GWDSP). 
Based on this, it seems that the terms GWD and GWDSP 
might not bring additional value to a model that already 
inc ludes the GWESP term. However, the conclusions drawn 
bv lGoodreaul \2mt \ might not be transferable to our case 
because our data is different; for example, we do not have 
nodal attribute data. 



3. Fitting the models 

In order to compare networks generated by dif- 
ferent models, it is necessary to unify some of their 
properties. To this end, we fit the models to two 
real- world data sets with respect to as many of the 
most relevant network features as the model pa- 
rameters allow. Our fitting method consists of sim- 
ulating network realizations with different model 
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Table 4: Category: exponential random graph models (ERGM) with structural dependencies. 



Parameters Definition 
ERGMl fSiiiiders et al. . 20061 

The model is defined with three terms: edge count L, geometrically weighted edge-wise 
4 free shared partners (GWESP) v{x; r) (Eq.[2]), and geometrically weighted degree (GWD) u{x; r) 

A*', Ol^ (Eq. [3]), as the probability distribution 

Ogwesp^ 

OawD p^ ^ {X = x) - "^""P ^^^^ ^ Ogwespv{x) + Bawpuix)} 

(r = 0.25) c(e,X) 



parameters, and finding the parameter values that 
produce the best match to selected statistics. 

3.1. Targeted features for fitting 

The most important properties that we wish to 
align between the models and the data are the num- 
ber of nodes and links. Because both of our data 
sets are connected components of a larger network, 
we focus on the properties of the largest connected 
component of the generated networks. Our first 
two fitting targets are largest connected component 
size Nlc and the average number of Hnks per node, 
or average degree fc, within the largest component. 
They are already sufficient for fitting the DEB and 
Vaz models, which have only two parameters. A 
natural choice for the next target is some measure 
related to triangles, because they are highly preva- 
lent in social networks. We will use the average 
clustering coefficient c (please see Appendix lA.ll for 
the definition), which is a well-established charac- 
terization of local triangle density in the complex 
networks literature. All of the network evolution 
models in this study had as one of their aims ob- 
taining a high clustering coefficient. These three 
features are sufficient for fitting the rest of the mod- 
els except WPR, if we fix some of the parameters 
according to the original authors' choices (please 
see Table [J). 

If matching Nlc, k and c is not enough to fix 
all parameters of the model, we no longer have a 
straightforward choice. We considered using the as- 
sortativity coefficient and geodesic path lengths (see 
lA.ip . In the WPR model, assortativity varies 
closely together with the average clustering coeffi- 
cient, so it could not be used as a fourth target fea- 
ture. Instead, we used the average geodesic path 
length. We also attempted using the assortativ- 
ity coefficient for fitting the KOSKK model, allow- 



ing the weight increment parameter 5 to vary, but 
ran into a different problem: attempting high as- 
sortativity forced the weight increment parameter 
to zero, thereby eliminating an important feature of 
the weighted model and weakening the community 
structure. Hence, we fixed S = 0.5 in accordance 
with the authors' choice. 

All of these measures - degree, high clustering, 
assortativity, and geodesic path lengths - assess im- 
portant properties of social networks, which are 
likely to affect dynamics such as opinion fo r mation 
or spreading o f information llOnnela et al" , 2007a ; 
Moreno et all . l2004t [Castellano et all . l2007l) . The 
average properties can typically be tuned by vary- 
ing parameter values, but the general shapes of the 
distributions are likely to be invariable. 



3.2. The friendship network at www. last. fm and the 
email network 

We selected two social network data sets with 
sHghtly different average properties, in order to get 
a better picture of the adaptability of the models. 
They differ in average degree, average clustering co- 
efficient, and the assortativity coefficient, although 
both display assortativity and high clustering. 

We collected a mutual friendship network of 
users of the web service last.fm. At the web site 
www.last.fm, people can share their musical tastes 
and designate other users as their friends. We used 
for this study only the friendship information, disre- 
garding the musical preferences. Because there are 
several hundred thousand users on the site world- 
wide, we selected users in one country, Finland, 
to obtain a smaller network with 8003 individuals. 
The country labels were self-reported. This data 
set (henceforth called lastfm) represents the largest 
connected component of Finnish users at this site. 
Individuals in the resulting network have on the 



average k — 4.2 friends, and a high clustering coef- 
ficient c = 0.31. The network is highly assortative 
with r = 0.22, indicating that friends of those users 
who have many connections at the site are them- 
selves well connected (please see Appendix I A. II for 
definitions). After designating someone as a friend, 
there is no cost to maintaining the tie, i.e. the link 
never expires. This means that the data may over- 
estimate the number of active friendships within 
the last-fm web site. However, the degree distribu- 
tion is not broader than that observed in a network 



constr ucted from mobile phone calls IjOnnela et al 



20G7bh . in which each contact has a real cost in 
time and money. Requiring ties to be reciprocated 
ensures that the users have at least both acknowl- 
edged one another. 

We also use a smaller acquain tance network col- 
lected by Guimera et al. 1 20031 ). based on emails 
between members of the University Rovira i Vir- 
gin (Tarragona). In the derived network, two in- 
dividuals are connected if each sent at least one 
email to the other during the study period, and 
bulk emails sent to more than 50 recipients are 
eliminated. Again, we use the largest connected 
component of the network. It consists of 1133 in- 
dividuals, and it is a compact network with aver- 
age geodesic path length / = 3.6, average degree 
k = 9.6, fairly high average clustering coefficient c 
= 0.22, and fairly small assortativity r = 0.08. 

Both of our empirical networks are unweighted, 
meaning that tie strengths are not specified. All of 
the models studied here apart from KOSKK are 
unweighted as well. Averaged basic statistics of 
both data sets are displayed in Table [6l The degree 
distributions, clustering spectra and degree-degree 
correlations of the lastfm and email networks are 
shown in Fig. [3l and more plots of their statistics 
are shown in Section [4] in connection with the fitted 
models. 

Table [5] indicates which features were targeted 
when optimizing the parameters of each model, and 
displays the optimized parameters. Table[6] displays 
properties of the networks generated with these pa- 
rameters. Due to the stochastic nature of the mod- 
els, two network realizations generated with the 
same parameters are not likely to have exactly the 
same average properties. The plots and tables con- 
cerning the model networks in this paper always 
contain values averaged over 100 network realiza- 
tions. 

Fitting to a limited number of data sets does 
not allow full assessment of the adaptability of the 
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Figure .3: Properties of the lastfm data set (•) and the email 
data (O). a) degree distrib utions, with average degrees k 
= 4.2 and 9.6, respectively. iGuimera et al fitted to 

the email data an exponential distribution p{k) = e~*'^^ 
with k* = 9.2, which shows as a straight line in a semilog- 
arithmic plot. The lognormal distribution fitted the lastfm 
data best of the different distributions we tried (exponential, 
WeibuU, gamma, and lognormal), although not perfectly, b) 
Clustering c(fc) decreases with degree k (average clustering 
c = 0.31 and 0.22, respectively), c) Degree-degree correla- 
tions between nodes and their neighbors (fc„„ signifies av- 
erage nearest neighbor degree) show that both networks are 
assortative (with r = 0.22 and r = 0.08, respectively). 



models. However, the features that we examine 
are similar in our two data sets as in other large 
scale empirical social networks, such as those based 



on communication via mobile phone IjOnnela et al 
2007bl : ISeshadri et"aLl. I2OO8II and M icrosoft Mes- 
senger I Leskovec and Horvitd . 2008h . For exam- 
ple, these networks have skewed degree distribu- 
tions that imply the presence of high degree nodes, 
high average clustering coefficients c, decreasing 
clustering spectra c(fc), and positive degree-degree- 
correlations r. A detailed description of the fitting 
procedure is included in Appendix lA. 21 

3.3. Adaptability of the models 

Not surprisingly, for almost all models, average 
largest component size Nlc and average degrees k 
could be fitted closely to both data sets. For the 
models with only two free parameters (DEB, Vaz), 
we had no control over other network features. 
These two-parameter models turn out to have ex- 
cessively high average clustering coefficients for the 
moderate average degrees displayed in our two data 
sets. For most of the other models, clustering could 
be tuned rather closely. The TOSHK model, with 
its discrete parametrization of the number of tri- 
angles formed, was not able to exactly match the 
clustering values despite having three parameters. 

For the model ERGMl, we allowed the average 
degree to remain slightly below the target in order 
to obtain correct clustering, because aiming at both 
correct average degree and clustering led to an in- 
stable region of model parameters. We initially at- 
tempted using automated optimization algorithms 



(such as snobfit i Huver and Neumaierl . l2008[ )) to fit 
the ERGMl model, but these failed due to the in- 
stability. Based on the intuition of the model pa- 
rameters obtained from the attempts at fitting, we 
initially selected values that roughly produced the 
desired N^c, k, and c, and manually modified them 
for a better fit. Starting from parameter values that 
generated networks in which the clustering coeffi- 
cient matched the email data and the average de- 
gree was only sHghtly too small, it turned out that 
a very small increase in the parameter 0l (done 
in order to increase average degree) caused average 
degree to jump dramatically and the clustering co- 
efficient to plummet (see Fig. [13] in Appendix I A. 2p . 
Hence, we settled for a lower value of k. 

Average geodesic path lengths I were approxi- 
mately correct for all but the nodal attribute model 
treated in one dimension (BPDA), although T was 
used for fitting only in the WPR model. The as- 
sortativity coefficient r was not used for fitting any 
model, although we attempted using it for fitting 
WPR and ERGMl. The ERGMl model was only 
fitted to the email data, because generating net- 
works of size 8000 and fitting their parameters did 
not seem feasible for the ERG model. 



Table 5: Targeted 
parameters leading 
email data sets. 



network features, and the fitted model 
to the values closest to the lastfm and 



DEB 


matched to Nj^fy, 
lastfm: N = 8330, 
email: N = 1138, 


k 

P = 
P = 


0.203 
0.064 


MVS 


matched to Nj^r;, 
lastfm: N = 9300, 
email: N = 2270, 


k , 
€ = 


0.0022, 71 = 0.000368 
0.0062, Tj = 0.000071 


KOSKK 


matched to Nj^r- , 
lastfm: N = 8205, 
email: N = 1135, 


fc , 
PA 
PA 


= 0.0029, pt- = 0.0008 
= 0.0107, pr = 0.0039 


TOSHK 


matched to iV, fc , 
lastfm: N = 8003, 
email: N = 1133, 


P = 
P = 


0.60, fc = 1 
0.06, fc = 3 


Vaz 


matched to N, k 
lastfm: N = 8003, 
email: N = 1133, 




0.524 
0.793 


ERGMl 


matched to Nj^n, ^ , 
lastfm: - 

email: N = 1160, = 
^GWn = 0-225 


= -6.962, Oq^eST = 2-4, 


BPDA 


matched to Nj^^q, 
lastfm: N = 8250, 
email: N = 1133, 


fc , 


1.915, fa = 1.51 - 10"'* 
1.565, fa = 0.002032 


WPR 


matched to N^^r;, 
lastfm: N = 8200, 
email: N = 1133, 


fc , 
H ^ 

H = 


c , r 

0.0108, p ^ 0.000506, p^ = 0.9994 
= 0.040, p 0.008498, p^ = 0.991 



"LC- 

degree, c: a 
fc, c, and I 



clustering coefRcient, I: a^ 
Iculated for the largest co 



f nodes), fc; average 
shortest path length, 
nt of the network. 



4. Comparison of higher order statistics 

Having fitted the models according to average 
values of particular network characteristics, we ad- 
dress their degree distributions P{k), clustering 
spectra c(fc), and geodesic path length distributions 
P{1). We also assess the community structure of the 
networks using several measures. In Section f45l we 
combine and compare the different mechanisms for 
triadic closure and Hnk deletion employed in the dy- 
namical NEMs. We use graphs to assess goodness 
of fit as promoted by Hunter et al.l 1 20081 ) . 



4-1- Degree distribution 

Degree distributions are shown in Fig. |4] for the 
email data and selected models. The exact shapes 
of the degree distributions produced by the mod- 
els are not as important as their markedly different 
probabilities for the presence of high degree nodes 
(Fig. 13]). The nodal attribute models, of which 
the lastfm fit of WPR is shown, produce skewed 
but fast-decaying degree distributions that imply 
the absence of nodes with very high degree. These 
distributions are well fit with the Poisson di s tribu - 
tion", as shown analytically bv lBoguna et al. ( 2004 ) 
for the BPDA model. The Vaz model produces 
a very broad d egree dis t ributi on (not shown) that 
was shown by Vazquez 1 2003f ) to decay as power 
law, P{k) ~ fc"''', which impHes the presence of a 
few nodes with extremely high degree. The tails of 
the degree distributions produced by the dynami- 
cal NEMs and the growing TOSHK model as well 
as the ERGMl model all appear to decay slower 
than the Poisson distribution, but faster than power 
law. Of these, the models TOSHK, KOSKK, and 
ERGMl are displayed in Fig.H 

In our data sets, the d egree distribu tion de- 



cays exponentially (email) ijGuimera et al.l . l2003l ) 
or slower (lastfm) (Fig.|3|. In larger data sets based 
on one-to-one communication, ev en broader degree 
distributions have been obs e rved llLambiotte et al 



20081 : lOnnela et all . l2007bl : ISeshadri et al.1 . l2008h 



The NEMs give rise to degree distributions that 



^The homophily principle does not always lead to a Pois- 
son degree distribution. The shape of the degree distribu- 
tion depends on how the nodal attributes are distributed. 
iMasuda and Konnd ll2006l ) used an exponentially distributed 
fitness parameter as the basis for homophily, and obtained 
a flat degree distribution P(k)=const. As they observe, 
this is unrealistic. Combined with another mechanism, 
homophily can also lead to a broader degree distribution 
l|Masuda and Konnol . l2006l ). 
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Table 6: Basic statistics of the lastfm and email data sets and the models fitted to each. 
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All statistics are calculated for the largest component of each network. Nlc'- Largest component size, L: number 
of links, k: average degree, c: average clustering coefficient, r: assortativity coefficient, /: average geodesic path 
length, and Imax'- longest geodesic path length. The values are averaged over 100 realizations of each network 
model. The standard error of the averages is displayed whenever there was fluctuation in the values. 



match these empirical data on large acquaintance 
networks better than the nodal attribute models. 




1 5 10 15 20 25 30 35 1 5 10 15 20 25 30 35 



Figure 4: Degree distributions P{k) of the email data (solid 
line) and in selected models fitted to it. The box plots dis- 
play medians and first and third quartiles in 100 network 
realizations. Whiskers extend from each end of the box to 
the most extreme values in the data within 1.5 times the 
interquartile range from the ends of the box. Outliers are 
denoted by +. 



4-2. Clustering spectrum 

Many network models display roughly an in- 
verse relation between node degree and clustering^ : 



^This follows naturally in any model where an increase 
in the number of links of a node goes hand in hand with an 



c{k) ~ This holds true also for most of the 
NEMs studied here, of which TOSHK, KOSKK, 
and DEB are shown in Fig. [3 as well as for 
the ERGMl model (not shown). The figures dis- 
play fits to lastfm data, but results are similar for 
the email fits. In contrast, the homophily mech- 
anism on which the nodal attribute models are 
based is seen to produce a flat clustering spec- 
trum cik) = const, shown in Fig. [5] for the lastfm 
flt of the WPR model. In all empirical network 
data that we have come across, including both of 
our data sets (Fig. [5|) as well as acquaintance net- 
works based_on_^Iessenger a nd mobile phone calls 
(e.g. lOnnela et all (2007b), Leskovec and Horvitz 
( 2008h ). clustering c(k) decreases with increasing 
degree A: of a node. This indicates that attribute 
based homophily alone does not seem to explain ob- 
ser ved network structur e s, sup portin g the findings 
by Masuda and Konnol I 2006h and Hunter et al 
(|2008h . 



4-3. Geodesic paths 

Apart from the nodal attribute model treated 
one-dimensionally (BPDA), in which average 
geodesic path lengths are strikingly long compared 



increase in the number of triangles around it. If on average 
increasing the degree A; of a node by one is accompanied by 
an increase of the number A^a of triangles around the node 
by a, the resulting clustering coefficient for a node of degree k 
will be on average c(fe) — — 
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Figure 5: Clustering spectrum c(fc) in the lastfm data (solid 
line) and in models fitted to it. Averaged over 100 network 
realizations. 



to the data, all networks display reasonable path 
length distributions (Fig. [6]). The dynamical NEMs 
and the TOSHK model are slightly too compact, 
with largest path lengths falling below those in the 
data. The Vaz model, surprisingly, has rather long 
geodesic paths despite its broad degree distribu- 
tion. Generally, high degree nodes decrease path 
lengths across the network, but the high assorta- 
tivity of the Vaz networks seems to counter the 
effect. For reference, even in an extremely large 
acquaintan ce network of several n i illion individuals 
worldwide ( Leskovec and Horvitzl . l2008h , the aver- 
age distance between two individuals is 6.6, and 
path lengths up to 29 have been found. 



4.4- Community structure 

Cliques. Finally, we assess the community struc- 
ture of the networks. Perhaps the simplest possible 
measure of community structure is the number of 
cliques (Fig. [8^), or fully connected subgraphs, of 
different sizes. Figure [7| displays the average num- 
bers of cHques in the model networks. Because each 
network has roughly an equal number of nodes and 
links, the different numbers of cliques are due to 
the arrangement of links within the network and 
not to differences in global link density. It turns 
out that the NAMs produce clique size distributions 
that match the data sets fairly well in both fits. The 
WPR model, fitted to the email data, is shown in 
Fig. [71 The KOSKK and DEB models also pro- 
duce distributions roughly comparable to the em- 
pirical data, and the Vaz model in fact produces 
far too many large cliques when link density is high 



Figure 6: Distributions P(i) of geodesic path lengths I in 
models fitted to (a) the lastfm data, and (b) the email data. 
The data is shown as a thick black line in each panel. Aver- 
aged over 100 network realizations. 



(Fig. III). The MVS and TOSHK models have trou- 
ble producing large enough cliques when link den- 
sity is low (the lastfm fits). A possible explanation 
of why the MVS model produces very few cliques is 
indicated by the comparison of Section 14.51 where 
node deletion is seen to preserve more cliques than 
link deletion. The parametrization of the TOSHK 
model, requiring that the number of secondary con- 
tacts be drawn from a uniform distribution, severely 
limits the number of coincident triangles and hence 
cliques which can be formed. The ERGMl model 
produces the fewest cliques of all the models. 



Figure 7: Number n{k) of cliques of size k in the model 
networks fitted to the email data, shown as a solid line in 
each panel for reference. Cliques within larger cliques, such 
as triangles within a 4-clique, are not counted. Averaged 
over 100 network realizations. 
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k- clusters. We also identify communities us- 
ing the__fc2c/ig'Me2£ercoZaiion method developed 



by iPalla et al 



l|2005h . The method defines a k- 
cluster as a subgraph within which all nodes can 
be reached by 'rolling' a /c-clique such that all ex- 
cept one of its nodes are fixed (see Fig. [HJd). Fig- 
ure [9] displays the size distributions of /c-clusters 
with k = 4 and k = 5 for several models fitted to the 
email data. As the ERGMl model produced very 
few cliques apart from triangles, it cannot generate 
large /c-clusters for fc > 3. The other models gen- 
erally produce 4-cluster size distributions roughly 
matching the data, but large 5-clusters are rela- 
tively few. The Vaz model generates networks con- 
taining very large fc-clusters with high values of k. 
These are likely due to an extremely dense 'core' 
formed around nodes that joined the network early 
on. For example, each of the 100 network realiza- 
tions contained 10-cluster of size s = 72 ± 15 (not 
shown). Such dense clusters are not generally ob- 
served in empirical data. For example, in the lastfm 
and email data sets, the largest 10-clusters are of 
sizes 10 and 12, respectively. 



(a) 



S-clique 4-dique 5-clique 



A 



P') 4-cluster 




Figure 8: (a) fc-cliques for k = 3, 4, 5. (b) An example of a 
4-cluster with 6 nodes, highlighting the 4-cliques from which 
it is formed. 



Role of links with low overlap. In both of our em- 
pirical networks, as well as in the networks gen- 
erated by the studied models, a rather large frac- 
tion of edges does not participate in any triangles. 
In the lastfm and email data, the fraction of such 
edges is 31.2% and 22.4% respectively^ The DEB, 
TOSHK, Vaz, and ERGMl models produce slightly 
too few such links (20 to 22% in the the lastfm fits 
and 4 to 5% in the email fits, except 12,6% in the 



°This might be due to the nature of our empirical data 
sets, which are sampled from networks that are constantly 
growing with links and nodes accumulating over time. In 
them, a relatively large fraction of nodes are newcomers who 
have only established a few links to the system, such that 
triangles have not yet been formed around them. 
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Figure 9: Average number n{s) of fc-clusters of size s in a net- 
work, for fc = 4 (O) and fc = 5 {^), in the email data, and in 
models fitted to it. Averaged over 100 network realizations. 



email fit of ERGMl), whereas the nodal attribute 
models and KOSKK tend to generate slightly too 
many of them (35 to 40% in the the lastfm fits and 
27 to 41% in the email fits). 

We can ask what structural role is played by links 
that do not participate in triangles, or more gener- 
ally, by links whose end nodes share only a small 
fraction of their neighbors. Within a community, 
adjacent nodes tend to share many neighbors, while 
for edges between communities, the neighborhoods 
of the end nodes will not overlap much. This can 
be q uantified using a me asure called the overlap 
Oij ( Onnela et al. . 2007b[ ). which could be inter- 
preted as a modifi cation o f the edgewise shared 
partners measure ijHunten . l2007n . measuring the 
fraction instead of the number of edgewise shared 
partners for the end nodes of an edge. The mea- 
sure a lso bears resemblance to the Jaccard coeffi- 
cient ( Jaccardl . 1901 ). The overlap is defined as 
Oj-i = Tz — TTTTTT — in „ , where TLii is the number 



of neighbors common to both nodes i and j, and k 
and kj are their degrees (Fig. [T0|) . 




0,1 = 



Oij = 3/5 



Oi] = / 



Figure 10: Overlap Oij of edge e^j. 

Removing low-overlap-links will separate dense, 
loosely interconnected communities from one an- 
other. This turns out to discern the nodal attribute 
models and the KOSKK model from the other mod- 
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els and our empirical data. Figure HDJa) displays 
the relative sizes of the largest component after re- 
moving links that do not participate in triangles 
for the lastfm data and the models fitted to it. The 
nodal attribute models break down to small clus- 
ters, whereas in the other models a large core re- 
mains. 

As noted earlier, the NAMs contain more zero- 
overlap links than the other models. Hence, it is 
useful to check whether their breakdown was due 
to a larger fraction of removed links. We can test 
this by removing an equal fraction of links from 
all networks (41%, the maximum fraction of links 
removed from any network when only non-triangle- 
links were removed) (Fig. [Tib). We remove links 
in increasing order of overlap Oy . Again, a core 
remains intact in most of the NEMs, whereas the 
NAMs and the KOSKK network break down, indi- 
cating in these models the absence of a core, and the 
role of low overlap links as bridges between clusters. 

The link densities of the remaining components, 
d = 2 l/s{s — 1), where s is the number of nodes in 
the component and / the number of links, are more- 
over observed to be slightly higher in the NAMs 
than in the other models, despite the fact that more 
links were removed from them (not shown). The 
above findings show that these networks consist of 
very clear communities that are loosely intercon- 
nected. The other NEMs and ERGMl on the other 
hand contain a core that does not consist of such 
loosely connected clusters. This difference is de- 
picted schematically in Fig. [TiTc.d). 

In the email fits, link density in the network is 
higher, and for all networks slightly larger overlap 
links need to be removed in order to decompose 
them to small clusters (not shown), but the general 
difference between the NEMs and NAMs remains. 
As the ERGMl model was only fitted to the email 
data, it is not displayed in Fig. [TTl Removing low 
overlap links did not reduce the largest component 
of the ERGMl networks practically at all - even af- 
ter removing 50 percent of links beginning with low- 
est overlap, a core containing on average 93.6 per- 
cent of the nodes remains intact - consistently with 
the finding that the networks did not contain many 
denser substructures such as cliques or fc-clusters. 

4-5. Differences in network structure resulting from 
choice of mechanisms for triadic closure and 
link deletion 

Here, we will examine the differences in net- 
work structure resulting from combinations of the 



TOSHK 
DEB 



lastfm 




Figure 11: (a) Relative size Rlc of the largest connected 
component in the models fitted to the lastfm data after re- 
moving links with overlap = 0. (b) To show that the 
breakdown of the nodal attribute models was not simply due 
to a larger number of links removed, we now remove the same 
fraction of the lowest overlap links from all models and data 
(41%, the maximum fraction removed in Fig. Illf a')'). Data 
averaged over 100 network realizations, (c and d) Schematic 
depiction of the structural differences related to links with 
low overlap (links whose end nodes share only a small frac- 
tion of their neighbors), (c) Low overlap links connect small, 
relatively tightly bound clusters together, (d) The network 
contains a core that does not disintegrate when low overlap 
links are removed. 



mechanisms of link generation (T1,T2) and deletion 
(ND,LD) emplyed in dynamical network evolution 
models. Taking as a starting point the simplest of 
the dynamical models (DEB), in which a newcomer 
will link to exactly two uniformly randomly chosen 
nodes, after which it will only initiate triadic closure 
steps, we study all four combinations of the mech- 
anisms (Fig. [121 a). Two findings speak in favor 
of using the node deletion mechanism: The model 
variants using Tl show a clearly assortative rela- 
tion, suitable for social network models, whereas 
the T2 networks are dissortative or very weakly as- 
sortative (Fig. [121 b). Node deletion also preserves 
more cliques in the network, a desirable feature for 
social networks (Fig. [121 c) . The larger number of 
cHques preserved by node deletion is not explained 
by the clustering coefficients, which turned out to 
be similar in all networks. The parameters were 
selected such that Nlc and k matched the lastfm 
data. 

The choice of triangle generation mechanism, on 
the other hand, is seen to affect the degree distribu- 
tion. Networks generated with the Tl mechanism 
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have higher degree nodes than those using the T2 
mechanism (Fig.[l2l d). This is because following a 
link is more likely to lead to a high degree node than 
picking a node randomly. Because in Tl both of the 
nodes gaining a link in the triad formation step are 
chosen by following a link, high degree nodes ob- 
tain more additional links than when the T2 mech- 
anism is used, in which one of the nodes is chosen 
randomly. The choice of Tl or T2 does not seem 
to have an effect on the number or size of cliques 
generated, nor on degree-degree correlations. 
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Figure 12: Comparison of mechanisms employed in dynami- 
cal network evolution models, (a) Two mechanisms of triadic 
closure (Tl and T2) are combined with two ways of deleting 
links (node deletion refers to deleting all links of a node, and 
link deletion refers to deleting randomly selected links). The 
same symbols are used in panels (b)-(d). (b) Average nearest 
neighbor degree with respect to node degree fc, variants 
arranged as in the schematic figure. The lastfm data is also 
shown in each panel, (c) Number n{k) of cliques of each 
size k. Smaller cliques within larger cliques are not counted, 
(d) Degree distribution P{k). Averaged over 100 network 
realizations. 



5. Summary and discussion 

In order to assess the resemblance to empirical 
networks of the many models for social networks 
that have been published in recent years in the 
physics-oriented complex networks literature, we 
have fitted these models to empirical data and as- 
sessed their structure. We have also compared these 
models with an exponential random graph model 
that incorporates recently proposed specifications, 
in the first systematic comparison between models 
from these families. In addition to comparing struc- 
tural features of networks produced by the models, 
we have discussed the different philosophies under- 
lying the model types. 



The structural features we focused on are sim- 
ilar in the two included empirical data sets as 
in nur nerous other large empirical socia l net- 
works (lOnnela et all. r2007bl: ISeshadri etlH . l2008l : 
Leskovec and Horvit j . 2008h in that they have 



highly skewed degree distributions, high average 
clustering coefficients, decreasing clustering spec- 
tra c(fc), and positive degree-degree-correlations r. 
Therefore, any widely applicable model for social 
networks should be able to approximately repro- 
duce the average values and distributions of their 
main characteristic features. However, as the phi- 
losophy behind the NEMs studied here is to explain 
the emergence of common structural features of so- 
cial networks, we shouldn't expect them to capture 
perfectly all features of particular empirical data 
sets. Our main motivation for fitting the models to 
the selected target features was to unify approxi- 
mately some of their properties, in order to compare 
meaningfully their higher order properties, such as 
the degree distribution and community structure. 
These are not likely to be drastically altered by 
small differences in the average values. Hence, we 
do not consider an accurate fit in the average quan- 
tities of extreme importance. 

For almost all models, we saw that average 
largest component size N^c and average degrees k 
could be fitted closely to both empirical data sets. 
In the ERGMl model, we compromized matching 
average degree in order to obtain a reasonable clus- 
tering coefficient. Adaptability was limited by the 
number of free parameters. The models DEB and 
Vaz, which had only one free parameter in addition 
to network size, turned out to have excessively high 
average clustering coefficients even for the moder- 
ate average degrees displayed by our two data sets. 
For most of the other models, clustering could be 
tuned rather closely. Being able to match the tar- 
geted average values of these two data sets does 
not guarantee that a model is able to match those 
features in other empirical data, however. In this 
sense, the generalisability of conclusions based on 
only two data sets is limited. 

Table [7| summarizes the structural features in 
networks resulting from the different model types. 
Nodal attribute models (NAMs) in which the nodes 
are located with uniform probability in the under- 
lying social space and links are based solely on ho- 
mophily, produce a clustering spectrum c(k) strik- 
ingly different from observed data, indicating that it 
is not a sufficient description of the mechanisms at 
play in the formation of social networks. They also 
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Table 7: Summary of structural properties of networks generated with the studied models. 



Property 


iastfm and email 


NAMs 


dynamical NEMs 


growing NEMs 


ERGMl 


degree distribution 


relatively broad 


peaked 


relatively broad 


broad 


relatively broad 


clustering spectrum 




fiat 


decreasing 


decreasing 


decreasing 


assortativity 


yes 


yes (high) 


yes (weak) 


yes (moderate/ high) 


yes (weak) 


geodesic path lengths 




in ID, too long 
longest paths 




reasonable 






many large cliques 


many large cliques 


many in KOSKK, 
fewest in MVS 


too few in TOSHK, 
exceedingly in Vaz 




fc-clusters 


many large fc-clusters 
for = 4 and fc = 5 




reasonable in DEB and 
KOSKK, too few in MVS 


in VAz, exceedingly large 
fc-clusters with large k 


no large fc-clusters 


consisting of dense 
clusters interconnected 
by low-overlap links 




yes 


yes (KOSKK), 

no (DEB and MVS) 




no 



produce peaked degree distributions without very 
high degree nodes that do no agree with empirical 
data on large scale social networks. The homophily 
principle employed in the nodal attribute models 
is seen to be sufficient for producing strong positive 
degree-degree correlations. This is a direct result of 
the dependence of link probability on distance: be- 
cause high degree nodes appear in locations with a 
dense population of nodes, their neighbors will also 
tend to have high degree. The NAMs also generate 
networks containing a large number of cliques and 
consisting of dense clusters loosely connected with 
low overlap links. Their clustered structure appears 
more pronounced than in the data. 

We find that many of the studied network evolu- 
tion models (NEMs) produce broader degree dis- 
tributions and decreasing clustering spectra that 
agree more closely with empirical data. Most of 
them also generate assortative networks, although 
typically not to the same extent as in the data, and 
many large cliques and /c-clusters. In the dynamical 
NEMs, node deletion is seen to produce more as- 
sortative networks than link deletion. With respect 
to thresholding by overlap, the dynamical KOSKK 
model displayed the clearest clustered structure of 
all the NEMs. This shows that the weights em- 
ployed in tie formation in the KOSKK model play 
an important role in the formatio n of community 
struc ture, as the authors observed i Kumpula et al.l 
20071 ). The other NEMs produced networks which, 
in accordance with the data, contained a large core 
that did not break apart when low overlap links 
were removed. 

The exponential random graph model ERGMl 
incorporating recently proposed terms for 



structural dependencies ( Sniiders et al. , 20061 : 



Hunter et al. . 2008t Robins et al.l . 2007bl ) was seen 



to generate very few large cHques. It did produce 
assortative networks, although with relatively 
low assortativity. These terms had earlier been 
employed without diffic ulty when f itting ERGMs to 
a large social network ( Goodreaul . 2007t ). However, 



we encountered problems of multimodality with 
the model. 

Very large social networks of millions of individu- 
als, within a country or worldwide, can be assessed 
with data provided by modern electr onic communi- 
cation s, such as mobile pho ne calls llOnnela et al 



2007ar) or instant messaging ijLeskovec and Horvitz 



20081). The data have revealed features of large 



scale networks of human interaction that could not 
be discerned from a small subnetwork. These in- 
clude the tails of highly skewed distributions as 
well as distributions of mesoscale structures, such as 
the size distribution of communities. Modeling the 
structure observed in large networks benefits from 
the ability to generate networks of comparable size. 
NEMs and NAMs fulfill this requirement. 

Using realistic models for social networks in sim- 
ulation studies of social processes is essential in 
light of the knowledge th at network struc t ure in - 



fiuences many processes ijCastellano et al.l . l2007n . 



such as the emerg e nce a nd survival of coopera- 
tion dhozano et al. , 2008lj , spreading of inform a- 



y tion llOnne a et al.l. I^UUYat IMoreno et al.| . l2UU4i l o] 

], epidemics ( na and Pastor-Satorrasl 2002 ), an d co- 



a et al.1. l2007at iMorenq et al 



200j) or 



20071) . 



existence of opinions (jLambiotte et al.l 

Many structural characteristics of social networks 
were attained even with very simple mechanisms. 
However, neither the nodal attribute models based 
on homophily, nor the network evolution models 
based on triadic closure and global connections, 
were able to reproduce all important features of so- 
cial networks. As both mechanisms obviously are 
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present in the evolution social networks, a combi- 
nation of the model types could yield more realistic 
network models. 
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A. Appendix 

A.l. Basic network measures 

The network representation of social contacts consists 
of nodes representing the individuals, and links repre- 
senting the ties between them. An overline is used to 
denote averaging over all nodes (or links) within the 
network, or across several networks. We denote by A'^ 
the number of nodes in a network, i.e. network size. A 
component of a network is a connected subset of nodes. 
In this paper, we study the largest component LC of 
each network. We denote its size by Nlc- The number 
of network neighbors of a node is called its degree k. An 
isolated node has degree zero. 

A measure of local triangle density, the clustering 
coefficient, describes the extent to which the neighbors 
of node i are acquainted with one another: if none on 
them know each other, Ci is zero, while if all of them 
are acquainted, Ci = 1. For a node i with degree ki 
and belonging to Ti triangles, the clustering coefficient 
is defined as 

(4) 



Ci = 



h{h - l)/2' 

where the denominator ki{ki — l)/2 expresses the max- 
imum possible number of triangles i could belong to 
given its degree. The clustering coefficient is not de- 
fined for nodes with degree k < 2. The average clus- 
tering coefficient, averaged over all nodes with k > 2 
in the network, is denoted c. c(k) denotes the average 
clustering coefficient of nodes with degree k. The curve 
c(fc) is called the clustering spectrum. 

In large empirical social networks, typically high de- 
gree nodes tend to be finked to other high-degree nodes, 
and low-degree nodes tend to be linked among them- 
selves. One way of quantifying this effect is using lin- 
ear correlation, or the Pearson correlation coefficient, 
between the degrees ki and kj of pairs of connected 
no des. Thi s is al so called the assortativity coefficient 
r l|Newmanl[2003 l: 

T..hk,/E~[j:^l{h + k,)Y /E^ 



where E is the total number of links in the network. 
Assortativity can also be quantified using the measure 
average nearest neighbor degree fc„„(fc), found by taking 
all nodes with degree fc, and averaging the degrees of 
their neighbors. If the curve knn{k) plotted against k 
has a positive trend, nodes with high degree typically 
also have high-degree neighbors, and hence the network 
is assortative. 

The geodesic path length hj between nodes i and 
j in a network means the minimum number of links 
that need to be traversed in order to get from i to j. 
The average length I of geodesic paths between nodes 
describes the compactness of the network. 

A. 2. Determining optimal network parameters 

Our fitting method consists of simulating network re- 
alizations with different values of the model parameters, 
and finding the values (points in the parameter space) 
that produce the best match to the following features of 
the empirical data sets: average degree k, average clus- 
tering coefficient c, and average geodesic path lengths I 
(in this order of importance, depending on the number 
of model parameters). This approach deviates from the 
tradition of maximum likelihood estimation for fitting 
probabilistic models. 

We attempt to minimize the relative error in each 
chosen feature. For example, for average degree fc in a 
model with given parameter values p, being fitted to a 

■ T T rr-target 

data set with average degree fc , the relative error 



fc(p) 



rr-target 

■ k 



k 



■target 



(5) 



The errors for each feature are combined in the error 
function /(p), whose norm |/(p)| is minimized. For 
example, if fitting to Nlc, k and c, the error function 
and its norm take the shape 

f{p) = [wnlc'^Nlc "^k^k w^ec], (6) 
and its norm 



l/(p)l = 



(7) 



E 



J(fc? + fe|)/i5- 



[T..h{h + ki)Y /E^ 



The error function should have equally many com- 
ponents as there are network parameters. We chose 
weights that refiected the order of importance given 
to the targeted features, putting the most emphasis 
on matching the number of nodes and links, less on 
clustering, and least on average geodesic path lengths. 
It turned out that for nearly all of the models (DEB, 
MVS, KOSKK, Vaz, BPDA, email fit of WPR) the 
result was insensitive to weights, because the models 
were able to match the target values closely (up to the 
number of model parameters). In optimizing the mod- 
els DEB and KOSKK, we used a linear approximation 
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for the components of the error function, iteratively re- 
fining the approximation close to the optimum. For 
MVS, w e used the the w e ll est ablished Nelder-Mead 
method jNelder and Mecidl . Il965l l. which involves cal- 
culating values of the error function at the corners of 
a simplex (a triangle in 2-dimensional space, a tetrahe- 
dron in 3D). The optimal value of the error function is 
iteratively approached by rolling one corner of the sim- 
plex over the others such that the object moves towards 
the region where the function gets optimal values. The 
diameter of the simplex is adjusted during iteration to 
increase accuracy. 

Optimization algorithms were not needed for the Vaz 
and BPDA models and the email fit of WPR. For the 
Vaz model, a very good approximation for the optimal 
value of u can be obtained analytically. This estimate 
could be refined manually. For the BPDA model, the 
analytical estimates for k and c derived by the authors 
could be used as a starting point in optimization. We 
refined the initial estimates by first adjusting a to find 
the correct value for the clustering coefficient, and then 
changing b until the correct mean degree was found. 
For small enough adjustments, the latter corrections 
did not affect the value of c. The adjustments were 
done by trial and error, but it was not difficult to get 
an accurate fit for mean degree and clustering in this 
manner. For the email fit of WPR, it turned out that 
Nlc ~ N, and hence the number of free parameters 
was reduced, p was set to obtain desired average de- 
gree, and the two remaining parameters were optimized 
by generating networks with a grid of their values. 

Exact fits could not be obtained for TOSHK, 
ERGMl, and the lastfm fit of WPR. For WPR, we 
used weights [uuNi^cj^kjWcWi] = [4 4 2 1] and grid 
optimization similarly as in the email case, although it 
was costly in four dimensions. Obtaining values in a 
grid enabled us to visulize the dependence of the tar- 
geted features on the model parameters. It turned out 
that assortativity and clustering varied closely together, 
rendering assortativity useless as a fitting target if clus- 
tering was used; hence we used average geodesic path 
lengths, which enabled an optimum to be determined. 
As the TOSHK model has only one continuous param- 
eter p, it suffices to optimize p for all values of the dis- 
crete parameter k below some fcmax, making sure that 
fcmax is large enough. The parameter p was optimized 
to reach the desired mean degree for each k, and the 
pair (fc,p°''*(fc)) that provided the best match to the 
desired c was selected as the optimum. Optimization 
was carried out with the Matlab optimization toolbox 
function fminbnd.m, which is based on golden section 
search and parabolic interpolation. 

For the remaining case in which no exact match was 
found (ERGMl), we attempted using the linear approx- 
imation method and Nelder-Mead algorithm described 
above, as well as other, potentially more robust meth- 



ods jElster and NeumaieJ . Il995l : iHuver and Neumaierl . 
l2008l ). but these failed likely due to multimodality of 
the probability distribution. Figure [13] illustrates the 
instability we encountered when attempting to fit the 
ERGMl model to the email data. The panels display 
average degree k (a) and average clustering coefficient 
c (b) in networks generated with various values of 9l, 
with the other parameters kept constant at the values 
listed in Table [5] For each value of 6l, 60 network re- 
alizations are shown (drawn from MCMC chains with 
burn-in 5 X 10^ steps, and 5 realizations taken from 
each chain at intervals of 10^). Because 9l controls the 
number of random links, an increase in ^z, generally in- 
creases average degree and decreases average clustering. 
However, at roughly 9l = —6.961 we observe a sudden 
transition into a much denser, less clustered network. 
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Figure 1.3: (a) Average degree k and (b) average clustering 
coefRcient c in networks generated from the model ERGMl 
with different values of 6^ . 
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